15 research outputs found
Software-defined datacenter network debugging
Software-defined Networking (SDN) enables flexible network management, but as networks
evolve to a large number of end-points with diverse network policies, higher
speed, and higher utilization, abstraction of networks by SDN makes monitoring and
debugging network problems increasingly harder and challenging. While some problems
impact packet processing in the data plane (e.g., congestion), some cause policy
deployment failures (e.g., hardware bugs); both create inconsistency between operator
intent and actual network behavior. Existing debugging tools are not sufficient to
accurately detect, localize, and understand the root cause of problems observed in a
large-scale networks; either they lack in-network resources (compute, memory, or/and
network bandwidth) or take long time for debugging network problems.
This thesis presents three debugging tools: PathDump, SwitchPointer, and Scout,
and a technique for tracing packet trajectories called CherryPick. We call for a different
approach to network monitoring and debugging: in contrast to implementing
debugging functionality entirely in-network, we should carefully partition the debugging
tasks between end-hosts and network elements. Towards this direction, we present
CherryPick, PathDump, and SwitchPointer. The core of CherryPick is to cherry-pick the
links that are key to representing an end-to-end path of a packet, and to embed picked
linkIDs into its header on its way to destination.
PathDump is an end-host based network debugger based on tracing packet trajectories,
and exploits resources at the end-hosts to implement various monitoring and
debugging functionalities. PathDump currently runs over a real network comprising
only of commodity hardware, and yet, can support surprisingly a large class of network
debugging problems with minimal in-network functionality.
The key contributions of SwitchPointer is to efficiently provide network visibility
to end-host based network debuggers like PathDump by using switch memory as a
"directory service" — each switch, rather than storing telemetry data necessary for
debugging functionalities, stores pointers to end hosts where relevant telemetry data is
stored. The key design choice of thinking about memory as a directory service allows
to solve performance problems that were hard or infeasible with existing designs.
Finally, we present and solve a network policy fault localization problem that arises
in operating policy management frameworks for a production network. We develop
Scout, a fully-automated system that localizes faults in a large scale policy deployment
and further pin-points the physical-level failures which are most likely cause for
observed faults
Implementing ChaCha based crypto primitives on programmable SmartNICs
Control and management plane applications such as serverless function orchestration and 4G/5G control plane functions are offloaded to smartNICs to reduce communication and processing latency. Such applications involve multiple inter-host interactions that were traditionally secured using SSL/TLS gRPC-based communication channels. Offloading the applications to smartNIC implies that we must also offload the security algorithms. Otherwise, we need to send the application messages to the host VM/container for crypto operations, negating offload benefits. We propose crypto externs for Netronome Agilio smartNICs that implement authentication and confidentiality (encryption/decryption) using the ChaCha stream cipher algorithm. AES and ChaCha are two popular cipher suites, but we chose ChaCha since none of the smartNICs have ChaCha-based crypto accelerators. However, smartNICs have restricted instruction set, and limited memory, making it difficult to implement security algorithms. This paper identifies and addresses several challenges to implement ChaCha crypto primitives successfully. Our evaluations show that our crypto extern implementation satisfies the scalability requirement of popular applications such as serverless management functions and host in-band network telemetry. © 2022 ACM
IoT MUD enforcement in the edge cloud using programmable switch
Targeted data breaches and cybersecurity attacks involving IoT devices are becoming ever more concerning. To combat these threats and risks, the IETF standardized Manufacturer Usage Description (MUD), which allows IoT device vendors to specify the intended communication patterns (MUD profile) of an IoT device. MUD profile enables validation of the actual communication pattern of an IoT device with the intended behavior at run-time. However, the MUD specification was primarily intended for enforcement at the Local Area Network (LAN) of the IoT device, thus fragmenting the solution across multiple heterogeneous networks. MUD enforcement at higher levels in the network hierarchy (e.g., private edge for enterprise networks) eases security policy management and reduces processing overheads on the existing security infrastructure. To realize MUD enforcement at the edge, there are mainly two challenges: (1) How to identify an IoT device at the edge so that enforcing device-specific MUD profile on the IoT traffic is possible. (2) How to scale MUD enforcement to a large network of IoT devices. In this paper, we present our approach to address these challenges and validate IoT device communication at the edge. In order to scale MUD enforcement to a large IoT network, we leverage multi-stage pipeline architecture and stateful ALUs of P4 programmable switch and process IoT traffic in the dataplane. © 2022 ACM
Recommended from our members
Tracking P4 Program Execution in the Data Plane
While programmable switches provide operators with much-needed control over the network, they also increase the potential sources of packet-processing errors. Bugs can happen anywhere: in the P4 program, the controller installing rules into tables, or the compiler that maps the P4 program into the resource-constrained switch pipelines. Most of these bugs manifest themselves after certain sequences of packets with certain combinations of rules in the tables. Tracking each packet's execution path through the P4 program, i.e., the sequence of tables hit and the actions applied, directly in the data plane is useful in localizing such bugs as they occur in real time. The fact that programmable data planes require P4 programs to be loop-free and can perform simple integer arithmetic operations makes them amenable to Ball-Larus encoding, a well-known technique in profiling execution paths in software programs that can efficiently encode all N paths in a single [log(N)]-bit variable. However, for real-world P4 programs, the path variable can get quite large, making it inefficient for integer arithmetic at line rate. Moreover, the encoding could require a subset of tables, that would otherwise have no data dependency, to update the same variable. By carefully breaking up the P4 program into disjoint partitions and tracking each partition's execution path separately, we show how to minimally augment P4 programs to track the execution path of each packet
Special Issue on The Workshop on Performance of host-based Network Applications (PerfNA 2022)
With the advancement of highly network-powered paradigms like 5G, Microservices, etc. which are typically deployed as containers/VMs, there is a growing imperative on the host nodes to perform specialized network tasks like monitoring, filtering, tunneling, load-balancing, etc. While traditionally, these tasks were performed using switches and specialized middleboxes in the network, there is a demand to perform these network tasks on commodity hardware comprising of COTS servers. However, a major challenge is to perform these tasks at low-overhead and high reliability while maintaining low latency, high throughput, and flexibility
A Case For Cross-Domain Observability to Debug Performance Issues in Microservices
Many applications deployed in the cloud are usually refactored into small components called microservices that are deployed as containers in a Kubernetes environment. Such applications are deployed on a cluster of physical servers which are connected via the datacenter network.In such deployments, resources such as compute, memory, and network, are shared and hence some microservices (culprits) can misbehave and consume more resources. This interference among applications hosted on the same node leads to performance issues (e.g., high latency, packet loss) in the microservices (victims) followed by a delayed or low-quality response. Given the highly distributed and transient nature of the workloads, it's extremely challenging to debug performance issues. Especially, given the nature of existing monitoring tools, which collect traces and analyze them at individual points (network, host, etc) in a disaggregated manner.In this paper, we argue toward a case for a cross-domain (network & host) monitoring and debugging framework which could provide the end-to-end observability to debug performance issues of applications and pin-point the root-cause whether it is on the sender-host, receiver-host or the network. We present the design and provide preliminary implementation details using eBPF (extended Berkeley Packet Filter) to elucidate the feasibility of the system. © 2022 IEEE
Anomaly Detection in Data Plane Systems using Packet Execution Paths
Programmable data planes provide exciting opportunities to realize fast, accurate, and data-driven control-loop decisions. Many data plane systems have been proposed for handling network dynamics (e.g., congestion, failures) in near real-time. The core of these systems has packet-processing data-plane algorithms that continuously monitor traffic and respond automatically. Despite their benefits, automatic response to network events lead to increase in potential sources of inputs, and hence, increase in attack surface. This paper takes a step towards securing such systems by (1) identifying possible attacks on recently proposed data-driven data-plane systems; and (2) designing a scalable tool for detecting such attacks at run time. Our approach models plausible expected behavior and uses the model as a reference to check whether the system is under attack. We conduct preliminary experiments to demonstrate the feasibility of our detection methodology. © 2021 ACM